Importing the data

This course will give you some additional practice with importing and cleaning data through a series of four case studies.

You’ll be importing and cleaning four real datasets that are a little messier than before. Don’t worry – you’re up for the challenge!

Your first dataset describes online ticket sales for various events across the country. It’s stored as a Comma-Separated Value (CSV) file called sales.csv. Let’s jump right in!

# Import sales.csv: sales
sales <- read.csv("../xDatasets/sales.csv", stringsAsFactors = FALSE)

Examining the data

As you know from the Cleaning Data in R course, the first step when preparing to clean data is to inspect it. Let’s refresh your memory on some useful functions that can do that:

dim() returns the dimensions of an object
head() displays the first part of an object
names() returns the names associated with an object

# View dimensions of sales
dim(sales)
## [1] 5000   46
# Inspect first 6 rows of sales
sales %>% 
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
X event_id primary_act_id secondary_act_id purch_party_lkup_id event_name primary_act_name secondary_act_name major_cat_name minor_cat_name la_event_type_cat event_disp_name ticket_text tickets_purchased_qty trans_face_val_amt delivery_type_cd event_date_time event_dt presale_dt onsale_dt sales_ord_create_dttm sales_ord_tran_dt print_dt timezn_nm venue_city venue_state venue_postal_cd_sgmt_1 sales_platform_cd print_flg la_valid_tkt_event_flg fin_mkt_nm web_session_cookie_val gndr_cd age_yr income_amt edu_val edu_1st_indv_val edu_2nd_indv_val adults_in_hh_num married_ind child_present_ind home_owner_ind occpn_val occpn_1st_val occpn_2nd_val dist_to_ven
1 abcaf1adb99a935fc661 43f0436b905bfa7c2eec b85143bf51323b72e53c 7dfa56dd7d5956b17587 Xfinity Center Mansfield Premier Parking: Florida Georgia Line XFINITY Center Mansfield Premier Parking NULL MISC PARKING PARKING Xfinity Center Mansfield Premier Parking: Florida Georgia Line THIS TICKET IS VALID FOR PARKING ONLY GOOD THIS DAY ONLY PREMIER PARKING PASS XFINITY CENTER,LOTS 4 PM SAT SEP 12 2015 7:30 PM 1 45 eTicket 2015-09-12 23:30:00 2015-09-12 NULL 2015-05-15 2015-09-11 18:17:45 2015-09-11 2015-09-12 EST MANSFIELD MASSACHUSETTS 02048 www.concerts.livenation.com T N Boston 7dfa56dd7d5956b17587 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 6c56d7f08c95f2aa453c 1a3e9aecd0617706a794 f53529c5679ea6ca5a48 4f9e6fc637eaf7b736c2 Gorge Camping - dave matthews band - sept 3-7 Gorge Camping Dave Matthews Band MISC CAMPING INVALID Gorge Camping - dave matthews band - sept 3-7 %OVERNIGHT C A M P I N G%* * * * * %GORGE CAMPGROUND% GOOD THIS DATE ONLY *%SEP 3 - 6, 2009 1 75 TicketFast 2009-09-05 01:00:00 2009-09-04 NULL 2009-03-13 2009-07-06 00:00:00 2009-07-05 2009-09-01 PST QUINCY WASHINGTON 98848 NULL T N Seattle 4f9e6fc637eaf7b736c2 NA NA NA NA NA NA NA NA NA NA NA NA NA 59
3 c7ab4524a121f9d687d2 4b677c3f5bec71eec8d1 b85143bf51323b72e53c 6c2545703bd527a7144d Dodge Theatre Adams Street Parking - benise Parking Event NULL MISC PARKING PARKING Dodge Theatre Adams Street Parking - benise ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM 1 5 TicketFast 2006-04-22 01:30:00 2006-04-21 NULL 2006-02-25 2006-04-05 00:00:00 2006-04-05 2006-04-05 MST PHOENIX ARIZONA 85003 NULL T N Arizona 6c2545703bd527a7144d NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 394cb493f893be9b9ed1 b1ccea01ad6ef8522796 b85143bf51323b72e53c 527d6b1eaffc69ddd882 Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow Gexa Energy Pavilion VIP Parking NULL MISC PARKING PARKING Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow THIS TICKET IS VALID FOR PARKING ONLY GOOD FOR THIS DATE ONLY VIP PARKING PASS GEXA ENERGY PAVILION FRI SEP 02 2011 7:00 PM 1 20 Mail 2011-09-03 00:00:00 2011-09-02 NULL 2011-04-22 2011-07-01 17:38:50 2011-07-01 2011-07-06 CST DALLAS TEXAS 75210 NULL T N Dallas 527d6b1eaffc69ddd882 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 55b5f67e618557929f48 91c03a34b562436efa3c b85143bf51323b72e53c 8bd62c394a35213bdf52 Premier Parking - motley crue White River Amphitheatre Premier Parking NULL MISC PARKING PARKING Premier Parking - motley crue THIS TICKET IS VALID%FOR PARKING ONLY%GOOD THIS DATE ONLY%PREMIER PARKING PASS%WHITE RIVER AMPHITHEATRE%SAT JUL 30, 2005 6:00PM 1 20 Mail 2005-07-31 01:00:00 2005-07-30 2005-03-02 2005-03-04 2005-06-18 00:00:00 2005-06-18 2005-06-28 PST AUBURN WASHINGTON 98092 NULL T N Seattle 8bd62c394a35213bdf52 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 4f10fd8b9f550352bd56 ac4b847b3fde66f2117e 63814f3d63317f1b56c4 3b3a628f83135acd0676 Fast Lane Access: Journey Fast Lane Access Journey MISC SPECIAL ENTRY (UPSELL) UPSELL Fast Lane Access: Journey FAST LANE JOURNEY FAST LANE EVENT THIS IS NOT A TICKET SAN MANUEL AMPHITHEATER SAT JUL 21 2012 7:00 PM 2 10 TicketFast 2012-07-22 02:00:00 2012-07-21 NULL 2012-04-11 2012-07-21 17:20:18 2012-07-21 2012-07-21 PST SAN BERNARDINO CALIFORNIA 92407 www.livenation.com T N Los Angeles 3b3a628f83135acd0676 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# View column names of sales
names(sales)
##  [1] "X"                      "event_id"              
##  [3] "primary_act_id"         "secondary_act_id"      
##  [5] "purch_party_lkup_id"    "event_name"            
##  [7] "primary_act_name"       "secondary_act_name"    
##  [9] "major_cat_name"         "minor_cat_name"        
## [11] "la_event_type_cat"      "event_disp_name"       
## [13] "ticket_text"            "tickets_purchased_qty" 
## [15] "trans_face_val_amt"     "delivery_type_cd"      
## [17] "event_date_time"        "event_dt"              
## [19] "presale_dt"             "onsale_dt"             
## [21] "sales_ord_create_dttm"  "sales_ord_tran_dt"     
## [23] "print_dt"               "timezn_nm"             
## [25] "venue_city"             "venue_state"           
## [27] "venue_postal_cd_sgmt_1" "sales_platform_cd"     
## [29] "print_flg"              "la_valid_tkt_event_flg"
## [31] "fin_mkt_nm"             "web_session_cookie_val"
## [33] "gndr_cd"                "age_yr"                
## [35] "income_amt"             "edu_val"               
## [37] "edu_1st_indv_val"       "edu_2nd_indv_val"      
## [39] "adults_in_hh_num"       "married_ind"           
## [41] "child_present_ind"      "home_owner_ind"        
## [43] "occpn_val"              "occpn_1st_val"         
## [45] "occpn_2nd_val"          "dist_to_ven"

Notice how the rows appear to represent individual purchases and the columns contain different pieces of information about each purchase.

Summarizing the data

Luckily, the rows and columns appear to be arranged in a meaningful way: each row represents an observation and each column a variable, or piece of information about that observation.

In R, there are a great many tools at your disposal to help get a feel for your data. Besides the three you used in the previous exercise, the functions str() and summary() can be very helpful.

The dplyr package, introduced in Cleaning Data in R, offers the glimpse() function, which can also be used for this purpose.

# Look at structure of sales
str(sales, give.attr = FALSE)
## 'data.frame':    5000 obs. of  46 variables:
##  $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ event_id              : chr  "abcaf1adb99a935fc661" "6c56d7f08c95f2aa453c" "c7ab4524a121f9d687d2" "394cb493f893be9b9ed1" ...
##  $ primary_act_id        : chr  "43f0436b905bfa7c2eec" "1a3e9aecd0617706a794" "4b677c3f5bec71eec8d1" "b1ccea01ad6ef8522796" ...
##  $ secondary_act_id      : chr  "b85143bf51323b72e53c" "f53529c5679ea6ca5a48" "b85143bf51323b72e53c" "b85143bf51323b72e53c" ...
##  $ purch_party_lkup_id   : chr  "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
##  $ event_name            : chr  "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
##  $ primary_act_name      : chr  "XFINITY Center Mansfield Premier Parking" "Gorge Camping" "Parking Event" "Gexa Energy Pavilion VIP Parking" ...
##  $ secondary_act_name    : chr  "NULL" "Dave Matthews Band" "NULL" "NULL" ...
##  $ major_cat_name        : chr  "MISC" "MISC" "MISC" "MISC" ...
##  $ minor_cat_name        : chr  "PARKING" "CAMPING" "PARKING" "PARKING" ...
##  $ la_event_type_cat     : chr  "PARKING" "INVALID" "PARKING" "PARKING" ...
##  $ event_disp_name       : chr  "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
##  $ ticket_text           : chr  "   THIS TICKET IS VALID        FOR PARKING ONLY         GOOD THIS DAY ONLY       PREMIER PARKING PASS    XFINIT"| __truncated__ "%OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009" "ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM" "   THIS TICKET IS VALID        FOR PARKING ONLY      GOOD FOR THIS DATE ONLY       VIP PARKING PASS        GEXA"| __truncated__ ...
##  $ tickets_purchased_qty : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ trans_face_val_amt    : num  45 75 5 20 20 10 30 28 20 25 ...
##  $ delivery_type_cd      : chr  "eTicket" "TicketFast" "TicketFast" "Mail" ...
##  $ event_date_time       : chr  "2015-09-12 23:30:00" "2009-09-05 01:00:00" "2006-04-22 01:30:00" "2011-09-03 00:00:00" ...
##  $ event_dt              : chr  "2015-09-12" "2009-09-04" "2006-04-21" "2011-09-02" ...
##  $ presale_dt            : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ onsale_dt             : chr  "2015-05-15" "2009-03-13" "2006-02-25" "2011-04-22" ...
##  $ sales_ord_create_dttm : chr  "2015-09-11 18:17:45" "2009-07-06 00:00:00" "2006-04-05 00:00:00" "2011-07-01 17:38:50" ...
##  $ sales_ord_tran_dt     : chr  "2015-09-11" "2009-07-05" "2006-04-05" "2011-07-01" ...
##  $ print_dt              : chr  "2015-09-12" "2009-09-01" "2006-04-05" "2011-07-06" ...
##  $ timezn_nm             : chr  "EST" "PST" "MST" "CST" ...
##  $ venue_city            : chr  "MANSFIELD" "QUINCY" "PHOENIX" "DALLAS" ...
##  $ venue_state           : chr  "MASSACHUSETTS" "WASHINGTON" "ARIZONA" "TEXAS" ...
##  $ venue_postal_cd_sgmt_1: chr  "02048" "98848" "85003" "75210" ...
##  $ sales_platform_cd     : chr  "www.concerts.livenation.com" "NULL" "NULL" "NULL" ...
##  $ print_flg             : chr  "T " "T " "T " "T " ...
##  $ la_valid_tkt_event_flg: chr  "N " "N " "N " "N " ...
##  $ fin_mkt_nm            : chr  "Boston" "Seattle" "Arizona" "Dallas" ...
##  $ web_session_cookie_val: chr  "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
##  $ gndr_cd               : chr  NA NA NA NA ...
##  $ age_yr                : chr  NA NA NA NA ...
##  $ income_amt            : chr  NA NA NA NA ...
##  $ edu_val               : chr  NA NA NA NA ...
##  $ edu_1st_indv_val      : chr  NA NA NA NA ...
##  $ edu_2nd_indv_val      : chr  NA NA NA NA ...
##  $ adults_in_hh_num      : chr  NA NA NA NA ...
##  $ married_ind           : chr  NA NA NA NA ...
##  $ child_present_ind     : chr  NA NA NA NA ...
##  $ home_owner_ind        : chr  NA NA NA NA ...
##  $ occpn_val             : chr  NA NA NA NA ...
##  $ occpn_1st_val         : chr  NA NA NA NA ...
##  $ occpn_2nd_val         : chr  NA NA NA NA ...
##  $ dist_to_ven           : int  NA 59 NA NA NA NA NA NA NA NA ...
# View a summary of sales
sum_sales <- as.data.frame(do.call(cbind, lapply(sales, summary)))
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 1)
sum_sales[,-1] %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
event_id primary_act_id secondary_act_id purch_party_lkup_id event_name primary_act_name secondary_act_name major_cat_name minor_cat_name la_event_type_cat event_disp_name ticket_text tickets_purchased_qty trans_face_val_amt delivery_type_cd event_date_time event_dt presale_dt onsale_dt sales_ord_create_dttm sales_ord_tran_dt print_dt timezn_nm venue_city venue_state venue_postal_cd_sgmt_1 sales_platform_cd print_flg la_valid_tkt_event_flg fin_mkt_nm web_session_cookie_val gndr_cd age_yr income_amt edu_val edu_1st_indv_val edu_2nd_indv_val adults_in_hh_num married_ind child_present_ind home_owner_ind occpn_val occpn_1st_val occpn_2nd_val dist_to_ven
Min. 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 1 1 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 0
1st Qu. character character character character character character character character character character character character 1 20 character character character character character character character character character character character character character character character character character character character character character character character character character character character character character character 12
Median character character character character character character character character character character character character 1 30 character character character character character character character character character character character character character character character character character character character character character character character character character character character character character character 26
Mean 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 1.639 77.077446 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 158.188854489164
3rd Qu. character character character character character character character character character character character character 2 85 character character character character character character character character character character character character character character character character character character character character character character character character character character character character character character 77.5
Max. character character character character character character character character character character character character 8 1520.88 character character character character character character character character character character character character character character character character character character character character character character character character character character character character character character 2548
NA’s 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 1 1 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 4677
# Load dplyr
library(dplyr)

# Get a glimpse of sales
glimpse(sales)
## Observations: 5,000
## Variables: 46
## $ X                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ event_id               <chr> "abcaf1adb99a935fc661", "6c56d7f08c95f2...
## $ primary_act_id         <chr> "43f0436b905bfa7c2eec", "1a3e9aecd06177...
## $ secondary_act_id       <chr> "b85143bf51323b72e53c", "f53529c5679ea6...
## $ purch_party_lkup_id    <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ event_name             <chr> "Xfinity Center Mansfield Premier Parki...
## $ primary_act_name       <chr> "XFINITY Center Mansfield Premier Parki...
## $ secondary_act_name     <chr> "NULL", "Dave Matthews Band", "NULL", "...
## $ major_cat_name         <chr> "MISC", "MISC", "MISC", "MISC", "MISC",...
## $ minor_cat_name         <chr> "PARKING", "CAMPING", "PARKING", "PARKI...
## $ la_event_type_cat      <chr> "PARKING", "INVALID", "PARKING", "PARKI...
## $ event_disp_name        <chr> "Xfinity Center Mansfield Premier Parki...
## $ ticket_text            <chr> "   THIS TICKET IS VALID        FOR PAR...
## $ tickets_purchased_qty  <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 4, ...
## $ trans_face_val_amt     <dbl> 45, 75, 5, 20, 20, 10, 30, 28, 20, 25, ...
## $ delivery_type_cd       <chr> "eTicket", "TicketFast", "TicketFast", ...
## $ event_date_time        <chr> "2015-09-12 23:30:00", "2009-09-05 01:0...
## $ event_dt               <chr> "2015-09-12", "2009-09-04", "2006-04-21...
## $ presale_dt             <chr> "NULL", "NULL", "NULL", "NULL", "2005-0...
## $ onsale_dt              <chr> "2015-05-15", "2009-03-13", "2006-02-25...
## $ sales_ord_create_dttm  <chr> "2015-09-11 18:17:45", "2009-07-06 00:0...
## $ sales_ord_tran_dt      <chr> "2015-09-11", "2009-07-05", "2006-04-05...
## $ print_dt               <chr> "2015-09-12", "2009-09-01", "2006-04-05...
## $ timezn_nm              <chr> "EST", "PST", "MST", "CST", "PST", "PST...
## $ venue_city             <chr> "MANSFIELD", "QUINCY", "PHOENIX", "DALL...
## $ venue_state            <chr> "MASSACHUSETTS", "WASHINGTON", "ARIZONA...
## $ venue_postal_cd_sgmt_1 <chr> "02048", "98848", "85003", "75210", "98...
## $ sales_platform_cd      <chr> "www.concerts.livenation.com", "NULL", ...
## $ print_flg              <chr> "T ", "T ", "T ", "T ", "T ", "T ", "T ...
## $ la_valid_tkt_event_flg <chr> "N ", "N ", "N ", "N ", "N ", "N ", "N ...
## $ fin_mkt_nm             <chr> "Boston", "Seattle", "Arizona", "Dallas...
## $ web_session_cookie_val <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ gndr_cd                <chr> NA, NA, NA, NA, NA, NA, "M", NA, NA, NA...
## $ age_yr                 <chr> NA, NA, NA, NA, NA, NA, "28", NA, NA, N...
## $ income_amt             <chr> NA, NA, NA, NA, NA, NA, "112500", NA, N...
## $ edu_val                <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_1st_indv_val       <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_2nd_indv_val       <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ adults_in_hh_num       <chr> NA, NA, NA, NA, NA, NA, "4", NA, NA, NA...
## $ married_ind            <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ child_present_ind      <chr> NA, NA, NA, NA, NA, NA, "1", NA, NA, NA...
## $ home_owner_ind         <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ occpn_val              <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ occpn_1st_val          <chr> NA, NA, NA, NA, NA, NA, "Craftsman Blue...
## $ occpn_2nd_val          <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ dist_to_ven            <int> NA, 59, NA, NA, NA, NA, NA, NA, NA, NA,...

Before moving on, scroll to the top of the glimpse() output. Notice the first column, X, which appears to just be counting.

Removing redundant info

You may have noticed that the first column of data is just a duplication of the row numbers. Not very useful. Go ahead and delete that column.

Remember that nrow() and ncol() return the number of rows and columns in a data frame, respectively.

Also, recall that you can use square brackets to subset a data frame as follows:

my_df[1:5, ] # First 5 rows of my_df
my_df[, 4] # Fourth column of my_df

Alternatively, you can remove rows and columns using negative indices. For example:

my_df[-(1:5), ] # Omit first 5 rows of my_df
my_df[, -4] # Omit fourth column of my_df

# Remove the first column of sales: sales2
sales2 <- sales[, -1]

Information not worth keeping

Many of the columns have information that’s of no use to us. For example, the first four columns contain internal codes representing particular events. The last fifteen columns also aren’t worth keeping; there are too many missing values to make them worthwhile.

An easy way to get rid of unnecessary columns is to create a vector containing the column indices you want to keep, then subset the data based on that vector using single bracket subsetting.

# Define a vector of column indices: keep
keep <- 5:(ncol(sales2) - 15)

# Subset sales2 using keep: sales3
sales3 <- sales2[, keep]

Separating columns

Some of the columns in your data frame include multiple pieces of information that should be in separate columns. In this exercise, you will separate such a column into two: one for date and one for time. You will use the separate() function from the tidyr package.

Take a look at the event_date_time column by typing head(sales3$event_date_time) in the console. You’ll notice that the date and time are separated by a space. Therefore, you’ll use sep = " " as an argument to separate().

# Load tidyr
library(tidyr)

# Split event_date_time: sales4
sales4 <- separate(sales3, event_date_time,
                   c("event_dt", "event_time"), sep = " ")

# Split sales_ord_create_dttm: sales5
sales5 <- separate(sales4, sales_ord_create_dttm,
                   c("ord_create_dt", "ord_create_time"), sep = " ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows
## [2516, 3863, 4082, 4183].

Did you see the warning message that just popped up in the console? No need to panic (yet). You’ll sort it out in the next exercise.

Dealing with warnings

Looks like that second call to separate() threw a warning. Not to worry; warnings aren’t as bad as error messages. It’s not saying that the command didn’t execute; it’s just a heads-up that something unusual happened.

The warning says Too few values at 4 locations. You may be able to guess already what the issue is, but it’s still good to take a look.

The locations (i.e. rows) given in the warning are 2516, 3863, 4082, and 4183. Have a look at the contents of the sales_ord_create_dttm column in those rows.

# Define an issues vector
issues <- c(2516, 3863, 4082, 4183)

# Print values of sales_ord_create_dttm at these indices
sales3$sales_ord_create_dttm[issues]
## [1] "NULL" "NULL" "NULL" "NULL"
# Print a well-behaved value of sales_ord_create_dttm
sales3$sales_ord_create_dttm[2517]
## [1] "2013-08-04 23:07:19"

Thee warning was just because of four missing values. You’ll ignore them for now, but if your analysis depended on complete date/time information, you would probably need to delete those rows.

Identifying dates

Some of the columns in your dataset contain dates of different events. Right now, they are stored as character strings. That’s fine if all you want to do is look up the date associated with an event, but if you want to do any comparisons or math with the dates, it’s MUCH easier to store them as Date objects.

Luckily, all of the date columns in this dataset have the substring “dt” in their name, so you can use the str_detect() function of the stringr package to find the date columns. Then you can coerce them to Date objects using a function from the lubridate package.

You’ll use lapply() to apply the appropriate lubridate function to all of the columns that contain dates. Recall the following syntax for lapply() applied to some data frame columns of interest:

lapply(my_data_frame[, cols], function_name)

Also recall that function names in lubridate combine the letters y, m, d, h, m, and s depending on the format of the date/time string being read in.

# Load stringr
library(stringr)

# Find columns of sales5 containing "dt": date_cols
date_cols <- str_detect(names(sales5), "dt")

# Load lubridate
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
# Coerce date columns into Date objects
sales5[, date_cols] <- lapply(sales5[, date_cols], ymd)
## Warning: 2892 failed to parse.
## Warning: 101 failed to parse.
## Warning: 4 failed to parse.
## Warning: 424 failed to parse.

there were a few more warnings… Sigh.

More warnings!

As you saw, some of the calls to ymd() caused a failure to parse warning. That’s probably because of more missing data, but again, it’s good to check to be sure.

The first two lines of code (provided for you here) create a list of logical vectors called missing. Each vector in the list indicates the presence (or absence) of missing values in the corresponding column of sales5. See if the number of missing values in each column is the same as the number of rows that failed to parse in the previous exercise.

As a reminder, here are the warning messages:

Warning message: 2892 failed to parse.
Warning message: 101 failed to parse.
Warning message: 4 failed to parse.
Warning message: 424 failed to parse.

# Find date columns (don't change)
date_cols <- str_detect(names(sales5), "dt")

# Create logical vectors indicating missing values (don't change)
missing <- lapply(sales5[, date_cols], is.na)

# Create a numerical vector that counts missing values: num_missing
num_missing <- lapply(missing, sum)

# Print num_missing
num_missing
## $event_dt
## [1] 0
## 
## $presale_dt
## [1] 2892
## 
## $onsale_dt
## [1] 101
## 
## $ord_create_dt
## [1] 4
## 
## $sales_ord_tran_dt
## [1] 0
## 
## $print_dt
## [1] 424

Yep, it was missing data again. Ah, the joys of working with real-life datasets!

Combining columns

Sure enough, the number of NAs in each column match the numbers from the warning messages, so missing data is the culprit. How to proceed depends on your desired analysis. If you really need complete sets of date/time information, you might delete the rows or columns containing NAs.

As your last step, you’ll use the tidyr function unite() to combine the venue_city and venue_state columns into one column with the two values separated by a comma and a space. For example, “PORTLAND” “MAINE” should become “PORTLAND, MAINE”.

# Combine the venue_city and venue_state columns
sales6 <- unite(sales5, "venue_city_state", c("venue_city", "venue_state"), sep = ", ")

# View the head of sales6
sales6 %>% 
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689") %>%
  scroll_box(width = "100%", height = "500px")
event_name primary_act_name secondary_act_name major_cat_name minor_cat_name la_event_type_cat event_disp_name ticket_text tickets_purchased_qty trans_face_val_amt delivery_type_cd event_dt event_time presale_dt onsale_dt ord_create_dt ord_create_time sales_ord_tran_dt print_dt timezn_nm venue_city_state venue_postal_cd_sgmt_1 sales_platform_cd print_flg la_valid_tkt_event_flg fin_mkt_nm
Xfinity Center Mansfield Premier Parking: Florida Georgia Line XFINITY Center Mansfield Premier Parking NULL MISC PARKING PARKING Xfinity Center Mansfield Premier Parking: Florida Georgia Line THIS TICKET IS VALID FOR PARKING ONLY GOOD THIS DAY ONLY PREMIER PARKING PASS XFINITY CENTER,LOTS 4 PM SAT SEP 12 2015 7:30 PM 1 45 eTicket 2015-09-12 23:30:00 NA 2015-05-15 2015-09-11 18:17:45 2015-09-11 2015-09-12 EST MANSFIELD, MASSACHUSETTS 02048 www.concerts.livenation.com T N Boston
Gorge Camping - dave matthews band - sept 3-7 Gorge Camping Dave Matthews Band MISC CAMPING INVALID Gorge Camping - dave matthews band - sept 3-7 %OVERNIGHT C A M P I N G%* * * * * %GORGE CAMPGROUND% GOOD THIS DATE ONLY *%SEP 3 - 6, 2009 1 75 TicketFast 2009-09-05 01:00:00 NA 2009-03-13 2009-07-06 00:00:00 2009-07-05 2009-09-01 PST QUINCY, WASHINGTON 98848 NULL T N Seattle
Dodge Theatre Adams Street Parking - benise Parking Event NULL MISC PARKING PARKING Dodge Theatre Adams Street Parking - benise ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM 1 5 TicketFast 2006-04-22 01:30:00 NA 2006-02-25 2006-04-05 00:00:00 2006-04-05 2006-04-05 MST PHOENIX, ARIZONA 85003 NULL T N Arizona
Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow Gexa Energy Pavilion VIP Parking NULL MISC PARKING PARKING Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow THIS TICKET IS VALID FOR PARKING ONLY GOOD FOR THIS DATE ONLY VIP PARKING PASS GEXA ENERGY PAVILION FRI SEP 02 2011 7:00 PM 1 20 Mail 2011-09-03 00:00:00 NA 2011-04-22 2011-07-01 17:38:50 2011-07-01 2011-07-06 CST DALLAS, TEXAS 75210 NULL T N Dallas
Premier Parking - motley crue White River Amphitheatre Premier Parking NULL MISC PARKING PARKING Premier Parking - motley crue THIS TICKET IS VALID%FOR PARKING ONLY%GOOD THIS DATE ONLY%PREMIER PARKING PASS%WHITE RIVER AMPHITHEATRE%SAT JUL 30, 2005 6:00PM 1 20 Mail 2005-07-31 01:00:00 2005-03-02 2005-03-04 2005-06-18 00:00:00 2005-06-18 2005-06-28 PST AUBURN, WASHINGTON 98092 NULL T N Seattle
Fast Lane Access: Journey Fast Lane Access Journey MISC SPECIAL ENTRY (UPSELL) UPSELL Fast Lane Access: Journey FAST LANE JOURNEY FAST LANE EVENT THIS IS NOT A TICKET SAN MANUEL AMPHITHEATER SAT JUL 21 2012 7:00 PM 2 10 TicketFast 2012-07-22 02:00:00 NA 2012-04-11 2012-07-21 17:20:18 2012-07-21 2012-07-21 PST SAN BERNARDINO, CALIFORNIA 92407 www.livenation.com T N Los Angeles

This dataset is much cleaner. Your next steps would depend on what specific analyses you wanted to perform; for now, we’ll call it a chapter. Next up, you’ll look at some data about “the T”, Boston’s public transit system.

# Save our tidy dataframe to csv file
write.csv(sales6,'../xDatasets/sales_clean.csv')